Project 4 - k-nearest neighbors

Project 4 - k-nearest neighbors
Due: Fri Feb 23, 2024 11:29am
Ungraded, 100 Possible Points
In Progress
NEXT UP: Submit Assignment
Unlimited Attempts Allowed
Available until Feb 23, 2024 11:29am

In this project you will use k-nearest neighbors to predict heart disease in patients and also to predict a variable in a dataset of your choice.

Part I

Using the Download cleveland dataset

 (see the readme Download readme), determine optimal values of k and an optimal set of attributes to use to maximize predictive power in predicting if a patient has heart disease or not. (The dataset has a "num" attribute. You will want to collapse all values 1-4 into a single value so that "num" is boolean.) Report your prediction results using the precision, recall, and F1 metrics (list all ten scores). Also report your k value and the set of attributes you are using. You must use sklearn.model_selection.train_test_split randomly to split your data. You must also use at least 10 iterations of cross-validation. For simplicity, you may use Monte Carlo cross-validation, but for just a bit more of a challenge, you can use 10-fold cross-validation. See this tutorial Links to an external site. for more information. For your chosen approach to model fitting you should report all precision/recall/F1 scores as well as the mean F1 score. You may report additional scores for other models if it sheds light on your final result.

As an example, suppose I think that using k=5 and four attributes is a good idea. So I write a function that chooses 4 attributes in a meaningful way based on a training dataset. For the 10-fold cross-validation, I call that function 10 times with different training datasets, and get back 10 different models. I test each of those models against the test dataset and get precision/recall/F1 scores. Suppose I'm not happy with the scores. Then I can modify my function to choose, say, 5 attributes, or, as another approach, any number of attributes as long as they're meaningful, or, as another approach, use 3 hand-picked attributes, etc. I would then run the cross-validation again.

Note: you may not use the KNeighborsClassifier tool in sklearn. You may use the GridSearchCV function.

At the beginning of class on the day that the project is due I will release a "challenge test dataset". Each team will predict on each patient in the test dataset and compute their F1 score. The team with the highest F1 score will receive 10 extra points. To make sure you're ready for the challenge, do a dry-run on this Download sample test dataset

(the actual challenge test dataset will have the same columns). Note that you will not be using cross validation on the challenge dataset; you will simply predict on each patient and compute an F1 score. The neighbors you will find will be in the original dataset, not the challenge dataset since there are many more patients in the original dataset than the challenge dataset. So it's recommended that you read each patient in the challenge dataset, find the neighbors in the original dataset, and predict the patient. Then compute the F1 score for all your predictions.

Part II

Find a dataset (this site Links to an external site. has some nice ones; see here for other resources) or create your own to try knn out on. Again, report your values of k, the attributes you are using, and your prediction results.

Write-up

Your write-up should have the following structure:

  • Part I
    • Introduction
    • Methods - how you determined what k and what attributes to use
    • Results
  • Part II
    • Introduction
    • Dataset - where you got the dataset from, what cleaning you did, etc...
    • Methods - only include this section if you did something different than you did in Part I
    • Results

 

 

Project 4 rubric
CriteriaRatingsPts
Project 4 rubric
CriteriaRatingsPts
Part I
45 pts
Full Marks
0 pts
No Marks
/ 45 pts
Part II
50 pts
Full Marks
0 pts
No Marks
/ 50 pts
Presentation Slides
5 pts
Full Marks
0 pts
No Marks
/ 5 pts
Total Points: 0
Keep in mind, this submission will count for everyone in your Project Groups group.

Choose a submission type

or